-
-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache various statistics to improve performance #204
base: main
Are you sure you want to change the base?
Conversation
Order of operations was wrong before, causing incorrect outputs.
The implementation uses an AVL tree to keep track of the low and high parts of the input array, and then updates the trees accordingly in O(log n) time (instead of O(n) provided by numpy.median).
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #204 +/- ##
==========================================
- Coverage 98.47% 98.21% -0.27%
==========================================
Files 43 43
Lines 1839 2075 +236
Branches 114 129 +15
==========================================
+ Hits 1811 2038 +227
- Misses 25 31 +6
- Partials 3 6 +3
|
Regarding the median: I was not satisfied with the performance of pure Python implementations of sorted containers which have |
An implementation that resolves #201.
Describe your changes
This is a rather large-looking PR (only because I am basing this off of #198 for ease-of-merge!), and is a sort of a first attempt (which is why I am marking it as a draft) at caching the various statistics properties as described in #201.
TL;DR: since at each iteration we only move one point in the dataset (basically map$(x_i, y_i) \mapsto (x'_i, y'_i)$ ), why not compute the new statistics ($\langle X' \rangle$ , $\text{Var}(X') $ , and $r(X', Y') $ ) in terms of the old ones ($\langle X \rangle$ , $\text{Var}(X) $ , and $r(X, Y) $ ) + the old ($(x_i, y_i) $ ) and new ($(x'_i, y'_i) $ ) points? This is essentially what this PR does.
Details
Mathematical derivation
The mean is shifted as:
where$\delta_x = x'_i - x_i$ (same for $Y$ ), the variance (or the standard deviation if you want) is shifted as:
while the correlation coefficient is shifted as:
where the only new quantity to keep track of is$\langle X Y \rangle$ (the rest can be obtained from the above 2).
Workflow
With the math out of the way, this is how the new way of computing the statistics works:
(x, y)
dataset, and make an instance ofStatistics
(subject to change for its somewhat ambiguous naming) from itStatistics
has a method,perturb
, which takes in 3 arguments: the row (index) we wish to perturb, and the 2 values of the perturbations (in thex
andy
directions), and returns an instance of the newSummaryStatistics
(i.e. the one from the perturbed data)_is_close_enough
returnsTrue
, we call theperturb
method again (it'supdate=True
, which actually updates the data in theStatistics
instanceNote that there's a bunch of new methods (
shifted_mean
,shifted_var
,shifted_stdev
,shifted_corrcoef
), which we may or may not want to make private (or refactor so they accept different arguments), as I didn't put too much thought into the overall API design (again, hence the "draft" status).Performance
Now, onto the performance results! Note that the benchmarks were done with #201 already in there, so the absolute numbers differ from the ones on
main
, but the reduction in the number of function calls is evident. I tested everything usingpython -m cProfile -m data_morph --seed 42 --start-shape panda --target-shape [shape]
.so we use 26.1M less function calls for all shapes in question, which results in about 35% faster performance (computed as (t_old - t_new) / t_old) * 100).
Discussion items
Some possible points of discussion:
shifted_*
functions private? Couldperturb
do without anupdate
argument? etc.)shifted_*
functions are a bit repetitive, and could take a while to run (I was initially very skeptical of the numerical stability, but it appears to be fine, though it could just be that I haven't tried enough scenarios to hit an instability)Feedback is welcome :)
Checklist